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Abstract 

This paper addresses the problem of predicting the k events that are most 
likely to occur next, over historical real-time event streams. Existing ap¬ 
proaches to causal prediction queries have a number of limitations. Eirst, 
they exhaustively search over an acyclic causal network to find the most 
likely k effect events; however, data from real event streams frequently reflect 
cyclic causality. Second, they contain conservative assumptions intended to 
exclude all possible non-causal links in the causal network; it leads to the 
omission of many less-frequent but important causal links. We overcome 
these limitations by proposing a novel event precedence model and a run¬ 
time causal inference mechanism. The event precedence model constructs a 
first order absorbing Markov chain incrementally over event streams, where 
an edge between two events signifies a temporal precedence relationship be¬ 
tween them, which is a necessary condition for causality. Then, the run-time 
causal inference mechanism learns causal relationships dynamically during 
query processing. This is done by removing some of the temporal precedence 
relationships that do not exhibit causality in the presence of other events in 
the event precedence model. This paper presents two query processing algo¬ 
rithms - one performs exhaustive search on the model and the other performs 
a more efficient reduced search with early termination. Experiments using 
two real datasets (cascading blackouts in power systems and web page views) 
verify the effectiveness of the probabilistic top-k prediction queries and the 
efficiency of the algorithms. Specifically, the reduced search algorithm re¬ 
duced runtime, relative to exhaustive search, by 25 — 80% (depending on the 
application) with only a small reduction in accuracy. 

Keywords: top-k query; event stream; causal network; prediction. 
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1 Introduction 


Causal prediction (e.g. lfT4l[T5l[39l ) is emerging as an essential field for real-time 
monitoring, planning and decision support in diverse applications such as stock 
market analysis, electric power grid monitoring, sensor network monitoring, net¬ 
work intrusion detection, and web click-stream analysis. There is a need for active 
systems to continuously monitor the event streams from these applications to allow 
for the prediction of future effect events in real time. Specifically, given a sequence 
of potentially causal events, many applications would benefit from good algorithms 
to predict the next most likely (namely, top k) effect events. The potentially huge 
answer space, however, and the unknown dynamics as well as the streaming nature 
of data make such top-k prediction a challenging task. 

Consider the following two scenarios as motivating examples. 

Example 1 Web page click stream: Consider web-based online systems. A ma¬ 
jority of them display the same content for everyone. However, the user experience 
can be more productive with a dynamic system where content is displayed based 
on real-time prediction of users’ most likely activities, given historical data. One 
can use the results (i.e., the web pages/links most likely to be visited next) to dis¬ 
play the most relevant links, content, and advertisements at each step of the user 
activity. Such an arrangement may help to retain the user longer by displaying the 
most relevant information, thereby increasing the content consumption (e.g., sales, 
page visits, ad clicks). q 

Example 2 Electric power grid: Consider an electric power grid. When com¬ 
ponents of a power grid fail, as a result of a storm, malfunction or cyber-attack, a 
cascading sequence of subsequent component failures may result, which may lead 
to a very large blackouts (e.g., i43]l ). Thus, a timely prediction of the components 
that are most likely to fail next, given a list of a few components that have failed, 
may enable operators to take mitigating actions (like shutting down sections of the 
power grid) before a large-scale blackout occurs. Cascading blackouts typically 
progress slowly (minutes to tens of minutes) in the initial stages; a few seconds 
delay to compute and implement emergency controls is generally sufficient. q 

In this paper, we address the challenge of continuous prediction of the top-k 
most probable next effects in real time streams. To the best of our knowledge, 
there are no existing top-k query processing mechanisms that are sufficiently effi¬ 
cient to support time-critical applications, such as Examples [T] and |2l Moreover, 
most previous work on the prediction of effect events given one or more cause 
events is based on inefficient exhaustive search over a large search space of causal 


2 


network (e.g., ||2l|43). A causal network represents the cause and effect relation¬ 
ships, called causality, in a directed acyclic graph. This traditional causal net¬ 
work model (e.g., ifTUl [TTl im |28l [33l |39l 113) has two major limitations to be 
used for causal prediction. First, since it is acyclic, it cannot have loops, such as 
A^B^C^A or bidirectional relationships such as A o 5, and conse¬ 
quently, does not support cyclic causality (e.g., ifT^l^ l. The event streams from 
many applications, however, do show cyclic relationships. For example, a visi¬ 
tor to a news web site may visit the home page, proceed to read an article, and 
then return to the home page, creating a cyclic relationship between these vertices 
in the graph. Second, the causal Markov condition, often considered an essential 
property of traditional causal networks, is conservative in the causal inference, and 
as a result removes many infrequent but important causal relationships from the 
causal network That is, the causal Markov condition calls for the 

removal of those relationships which could potentially be independent in the pres¬ 
ence of one or more events. The rationale for this is to avoid any suspicious and 
weak relationships. However, this approach often backfires by removing rare but 
important causal relationships 1341. We call this limitation the causal information 
loss. 

Based on these facts, we identify three central research problems - (1) how to 
model causal relationships among events in a stream to prevent causal information 
loss; (2) how to address cyclic causality in the causal model; (3) how to efficiently 
run a causal inference query on this causal model to continuously predict the top-k 
probable effects. 

To address these problems, first, we propose an event precedence model that 
captures temporal precedence relationship between every two event types into a 
first order absorbing Markov chain. We refer to the resulting model as an event 
precedence network (EPN), in which an edge signifies fhe femporal relafionship 
befween fwo evenfs. This inclusion of all femporal precedence - hence likely 
causal - relationships helps to avoid causal information loss. Note that EPN is 
a generative model of the observed event stream, which is built over a set of prede¬ 
fined evenf fypes insfead of evenf insfances. Second, we propose a run-time causal 
inference mefhod. Due fo cyclic causalify, causal inference cannof be performed 
until fhe cause even! whose effecls are being predicfed is known, buf a cause evenf 
is only observed at run time, hence run-time causal inference. EPN encodes all 
cyclic as well as non-cyclic precedence relationships from event streams on its 
edges, and therefore these edges are examined by run-time causal test (i.e., con¬ 
ditional independence test) to determine causality. Note that this run-time causal 
inference overcomes the two limitations of traditional causal model discussed ear¬ 
lier (i.e., lack of support for cyclic causality and loss of many important causal 
relationships). Third, we present two query processing algorithms - the Exhaus- 
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tive Search (ES) algorithm and the Reduced Search Early Termination (RSET) 
algorithm - to continuously predict top-k event types with the highest scores based 
on the inferred causal relationships. The ES algorithm formalizes the exhaustive 
search approach. The RSET algorithm is built upon the ES algorithm, and reduces 
the search space with possible early termination. As a result, it reduces the runtime 
with only marginal reduction in prediction accuracy. 

We conduct experiments to evaluate the performance of the proposed ES and 
RSET algorithms using two real datasets. Eor each dataset we perform two sets of 
experiments to evaluate their accuracies and runtimes, respectively. In each evalu¬ 
ation, there are two objectives. The first objective is to compare the run-time causal 
inference mechanism of the proposed algorithms (i.e., ES, RSET) against the state- 
of-the-art traditional causal inference mechanism called the East Causal Network 
Inference (ECNI) algorithm |T]. The ECNI algorithm is essentially inapplicable to 
our problem due to its lack of ability to handle cyclic causality and run-time causal 
inference, but is the best available in the state of the art. The second objective is 
to compare the query processing mechanisms between the ES algorithm and the 
RSET algorithm. 

The contributions of this paper are summarized as follows. 

1. It presents an event precedence model to represent the temporal precedence 
relationships between event types and proposes an algorithm to construct an 
event precedence network incrementally over event streams. 

2. It introduces a run-time causal inference mechanism to infer the causal re¬ 
lationships in real time, and proposes two query processing algorithms: Ex¬ 
haustive Search and Reduced Search Early Termination, to continuously pre¬ 
dict the top-k next effects over event streams. 

3. It empirically demonstrates the advantages of the proposed run-time causal 
inference mechanism and the query processing algorithms in terms of the 
prediction accuracy and the runtime. 

The remainder of the paper is organized as follows. Section |2] discusses the 
related work, and Section [3] presents some preliminary concepts. Sections |4] and |5] 
describe the event precedence model and the query processing model, respectively. 
Section^evaluates the proposed query processing algorithms. Section|7] concludes 
the paper and suggests future work. 

2 Related Work 

This section first discusses conventional causal inference techniques and then de¬ 
scribes how this paper makes unique contributions relative to other work related to 
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causal prediction. 

There are two approaches for constructing a traditional causal network. The 
first approach, search and score based (e.g., ifTOl [TTl HU HU), performs greedy 
search (usually hill climbing) over all possible causal networks of the data to se¬ 
lect the network with the highest score. This approach, however, has two limita¬ 
tions. First, the computational complexity increases exponentially as the number 
of variables in the causal network increases. Second, the problem of equivalence 
classes [8], where two or more network structures represent the same probability 
distribution, makes the causal direction between nodes quite random and therefore 
unreliable. The second approach, constraint-based (e.g., ||7] |33l HU |40l), which 
performs a large number of conditional independence tests between variables to 
construct a causal network, does not have the problem of equivalence classes. The 
state-of-the-art Fast Causal Network Inference (FCNI) algorithm |T| presents a tra¬ 
ditional constraint-based causal network inference mechanism over event streams. 
(Thus, we consider the FCNI algorithm as the representative of the traditional 
causal network approach in this paper.) The FCNI algorithm learns temporal prece¬ 
dence relationships from the event stream and performs causal inference between 
only those event types which exhibit temporal precedence relationship. Such an 
approach helps to reduce the number of conditional independence tests required 
for causal network inference. However, this algorithm assumes acyclic causality 
in the data. (The idea of temporal precedence-based conditional independence test 
has been incorporated in the work presented in this paper.) 

In addition, there has been some work (e.g., mMM) to support cyclic 
Bayesian network which aims to handle the cyclic causality in Bayesian networks. 
This work, however, still carries the drawbacks inherent in the Bayesian network 
approach - that is, the ambiguity of equivalence classes and the inability to meet 
the requirement of a causal network that the parent node in the network should 
always represent the direct cause - and hence is not useful in our work. 

The existing body of work on prediction only addresses inference of the like¬ 
lihood of occurrence of an effect variable given a cause variable (e.g., 1511^1^ 
[3QlH7l|47l), while the prediction of top-k effects requires finding the most likely 
k effects among all possible effect variables. Therefore, the only way to find the 
top-k next effects is to construct a traditional causal network, which ignores cyclic 
causality and suffers from causal information loss, over event streams and then 
infer the top-k effects of the cause exhaustively(e.g., El HSl HU)). To the best of 
our knowledge, there is no solution to address cyclic causality, mitigate the causal 
information loss, and perform only necessary partial search to find the top-k effects 
of the given causes over event streams. 

The well-established association rule mining algorithms (e.g., |[20l[35ll44ll l are 
extensively used for prediction and recommendation. However, association does 
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not necessarily imply causation (e.g., S |25l |27l |29l |38l |45l). Therefore, they 
are not useful to our problem due to the exclusion of the fundamental concept of 
causality. That is, two variables that are associated require stronger conditions, 
such as temporality and strength, to be considered causally related. A few works 
on top-k query processing in the Internet domain, such as over social-tagging net¬ 
works ifT^ and over web 2.0 stream 1361, have been published. Unlike our work, 
however, these works do not address causal prediction in an event-based environ¬ 
ment at all. 


3 Preliminaries 

In this section, we introduce the concepts that are central to the techniques ex¬ 
plained in the paper. 

3.1 Event Streams 

An event stream is a discrete, indefinitely long sequence of event instances. An 
event instance (or event) refers to a timestamped action which may have an effect. 
A prototype for creating events is called an event type. Each event instance is cre¬ 
ated by one event owner. An event type can have many instances, and an event 
owner can create many instances of any type. Two events are related to each other 
if they share common attributes such as event owner, location, and time. These 
attributes are called common relational attributes (CRAs). In Example [T] and Ex¬ 
ample |2l the CRAs are the session id and the blackout id, respectively . 

An event has the following schema: {timestamp, type, CRA, attribute-set). That 
is, an event has the timestamp at which it was created, the event type it belongs to, 
the CRA value, and a set of additional attributes called the attribute-set. An event 
is denoted as eij where i is the value of the CRA and j (=1, 2,3,...) is its event 
type id {Ej). 

Example 3 Eigure [T] shows an illustrative example of events in a user click event 
stream of Example [B The first field in each line (e.g., 621 ) denotes the actual 
event instance shown in the remainder of the line (e.g., ( 05/05/11 1:12 pm, 1, 2, 
[200s, ...])). The session id serves as the CRA and the webpage categories (e.g., 
frontpage, news, weather, sports, entertainment, tech, local, etc) are the event types. 
Eor instance, in the event instance 632 , 2 is the event type and 3 is the CRA. Note 
that the event type is represented by a numerical equivalent of the original event 
type (e.g., frontpage = Ei, news = E 2 , weather = E^, sports = E 4 , entertainment = 
Eq, tech = Eq, local = Ej). □ 
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05/05/11 1:12pm, 1, 2, [200s,...] 
05/05/11 1:13pm, 1, 1, [502s ,...] 
05/05/11 1:16pm, 3, 1, [385s ,...] 
05/05/11 1:16pm, 2, 3 [60s ,...] 
05/05/11 1:13pm, 3, 4 [97s , 
05/05/11 1:13pm, 4, 4 [114s ,...)- 


Figure 1: Sample of event instances in a stream from Example [B 



(a) Raw Event Stream. 


Partition IV 
Partition V 
Partition VI 
(b) Partitioned Window. 

{cijS are abbreviations of actual event instances such as shown in Figured]) 


Partition I 
Partition II 
Partition III 


e.s.5 e.56 e.5.5 e.57 


Figure 2: Event stream. 


We use a window, specifically called partitioned window fH, to accumulate the 
events from the stream for a user-specified observafion period T. As a preprocess¬ 
ing sfep fo group relafed evenfs in fhe window, fhese evenfs are partitioned by fhe 
CRA and fhen arranged in femporal order in each partition. Eigure |2| shows whaf 
a partitioned window looks like for fhe evenf sfream shown in Eigure dl Once fhe 
observafion period expires, fhe window shiffs fo fhe nexf bafch of evenfs. The lasf 
evenf of one window overlaps fhe firsf even! of fhe nexf window in order fo ensure 
consistency in even! precedence modeling across fwo consecutive windows. 

Definition 1 (Partition) A partition Wi in a partitioned window is defined as a set 
of observed events sharing the same CRA value i and arranged in the temporal 
order over a time period T, that is 

Wi = {eij{t)\t <T,i e A,j G [1, A^]} 

where t is the timestamp, A is the set of all possible CRA values, j is the event 
type id, and N is the number of event types. q 
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Symbols 

Definitions 

Np 

Number of partitions 

Nei 

Number of event instances in the z-th partition 

Ne 

Total number of event instances in all partitions 

Ei 

Event type with id i 

Sij 

Event instance of type j and CRA i 

N 

Number of event types 

Ci 

Cause event type at position i 

Ti 

Effect event type at position i 

0 

Causal search order 

S^ 

Event type at the z-th position in O 

Cs 

The most recent cause event type 

E 

Set of N event types in the data {Ei, E 2 ,..., E^v} 

Rk 

Ranked list of the top-k event types 

^instances 

Number of event instances 

Ncra 

Number of common relational attributes 

fiEi,E,) 

Number of observations of the instances of type Ej followed 
by the instances of type Ej 


Table 1: Definitions of main symbols. 


The events which are being predicted are effect events while the events which 
are used for prediction are cause events. We denote the cause event type and the 
effect event type as Cj € E and Tj € E, respectively, where i and j are the 
positions of the events in each sequence. Note that Q and Ei are not necessarily 
the same, and nor are Tj and Ej. Table [T] summarizes the key notations used in this 
paper. 


3.2 Causal Networks 

A causal network (or causal Bayesian network) is a directed acyclic graph G = {V, 
H) to encode causality, where V is the set of nodes (representing event types) and E 
is the set of edges between nodes. For each directed edge, the parent node denotes 
the cause, and the child node denotes the effect. 

The joint probability distribution of a set of N event types E = {Ei,..., E^} 
in a causal network is specified as 


N 


^’(E) = n P(E,|Pai) 


i=l 






(a) Causal network with event (b) Causal network with event 
type names. type notations. 


(The event type names in the subfigure a are from the web page click 
stream in Figure |2l and the event type notations in the subfigure b 
are symbols used to represent the real event type names.) 

Figure 3: Causal network. 


where Pai is the set of the parent nodes of event type Ei. 

Consider the event stream of Figure [2l The causal relationships among the 
event types in the stream may be modeled as a causal network like the one shown 
in Figure |3] 

3.3 Conditional Independence Tests 

A popular approach for testing the conditional independence (Cl) between two 
random variables X and Y in a set of random variables, C, is conditional mutual 
information (CMI) (e.g., ITHU). 

CMi(x,y|c) = EEE P{x,y,c)log2^p^^^^0^^ 

where P is the probability mass function calculated from the frequencies of vari¬ 
ables. CMI gives the strength of dependency between variables in a measurable 
quantity, which helps to identify the weak (or spurious) causal relationships. 

In the traditional CMI, two variables X and Y are said to be independent if 
CMI(X, y |C) = 0, and dependent otherwise. This criterion itself offers no distinc¬ 
tion between weak and strong dependencies. With a higher value of CMI(X, y |C), 
the dependency between X and Y should be considered stronger. Thus, to prune 
out weak dependencies, we need a threshold CMI value, below which we consider 
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the evidence “too weak”. To do so, we relate CMI with the test statistics 
as below. 

G^{X,Y\C) = 2- Ns- loge2 ■ CMI(X, y |C) 

where Ns is the number of samples (i.e., event instances). 

Under the independence assumption, G^ follows the distribution |[2^ with 
the degree of freedom df equal to [nx — 1)(% — 1) OseS where Ux, %, and 
Us are the number of possible distinct values of X, Y, and S, respectively. So, we 
perform the test of independence between X and Y given C by using the calculated 
G^ test statistics as the test statistics in a test, which provides the threshold 
based on df and significance level a, to validate the result. We set a as the generally 
accepted value of 95%. 

We define a Boolean function IsIndependent{X, Y, C) to test the conditional 
independence between two variables X and Y given a condition set of variables C 
using the G^ test statistics. It returns true if these two variables are conditionally 
independent; otherwise, it returns false. 

The unbounded and continuous nature of event streams of interest makes it 
infeasible to store all of the historical data. Therefore, we use an incremental ap¬ 
proach such that when a new batch of events is processed, we only update the 
record of the frequency of observations without storing the old events. 

4 Event Precedence Model 

In this section, we introduce the proposed incremental mechanism to model the 
precedence relationships between events in a network structure. 

4.1 Model 

To overcome the limitations of the existing causal models described in Section [T] 
we propose the event precedence model (EPM). It represents the temporal prece¬ 
dence relationships in a first order absorbing Markov chain, called event prece¬ 
dence network (EPN), over which further analysis is done to predict the probable 
effect events based on the observed cause events. Note that the temporal prece¬ 
dence is a required criterion of a causal relationship. To avoid any information 
loss, evidence of the precedence between every two events in the stream is pre¬ 
served. EPM takes the partitioned window (collected from the event stream) as an 
input and incrementally builds a model to reflect all precedence relationships in the 
input data. The actual data is discarded once a new batch of events arrives. Such 
an adaptive approach is essential for a streaming environment with continuous and 
unbounded data. 
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We make the following assumptions in the event precedence model. 

• Given an ordered sequence of cause event types {Cq, Ci,C^}, an in¬ 
stance of the effect event type Tq cannot occur without the occurrence of 
an instance of the most recent cause event type Cs- Moreover, while all 
past events influence the future events, the strongest influence is exerted by 
the immediately preceding event of the effect event. With this in mind, the 
precedence relationships only between every two consecutive events are con¬ 
sidered. 

• There cannot be a causal relationship between events of the same type and, 
therefore, such precedence relationships are ignored. 

• The cause and effect events should share the same CRA value. As described 
in Section 13.11 the events are grouped into partitions based on their CRA 
values (e.g., session id and blackout id in Examples [T] and |2j respectively). 
In other words, two events are not related to each other if they have different 
CRA values. 

The proposed EPM is a first order absorbing Markov chain ^2A\ where an ob¬ 
servation is independent of all previous observations except the most recent one 
and every state can reach an absorbing (a.k.a., terminating) state. Thus, the proba¬ 
bility of occurrence of an effect event given past cause events is given as follows. 


p(ro|Co,Ci,...,c,) = p(ro|c,). 
P{Tq\Cs) can be rewritten as below. 


P{Ta\Cs) 


P{To,Cs) 

PiCs) 


which, then, can be estimated as 


P{To\Ci) 


f{To,Cs) 

^^Ej£children{Cs) f{Ej,Cs) 


( 1 ) 


where f{Ei, Ej) denotes the number of observations in which instances of the type 
Ei precedes instances of the type Ej. 

In summary, EPM allows us to automatically build a tractable probabilistic 
graphical model from the events, discovering the existing dependencies among the 
event types in the event stream. These dependencies are represented by a graph, 
as illustrated in EigurelH where the conditional probabilities are stored at the node 
level. 
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Figure 4: Illustration of event precedence network construction from the event 
stream in Figure H] 

4.2 Algorithm 

Algorithm [U outlines the event precedence network construction algorithm. It has 
three steps: observation, graph generation, and evidence inscription. These steps 
are discussed below. 

1. Observation : This step observes adjacent neighbor events in each partition 
of the window to leam the precedence relationships and update the frequency 
matrix/. Note that, based on the assumptions stated earlier, the precedence 
relationships should be between events in the same partition and between 
events of different types. Suppose Ei and Ej are the event types of two adja¬ 
cent events. Then, their count f{Ei, Ej) is increased by 1. Additionally, note 
that the frequency matrix / is updated incrementally for each new partition 
of events. 

2. Graph Generation : This step starts with an edgeless graph G = {V, H) 
where V is the set of nodes (event types) and E is an empty set of edges. 
Then, for any evidence of the precedence relationship between event types 
Ei and Ej (i.e., f{Ei,Ej) > 0), an edge is added between the two nodes 
representing these event types. Note that the graph supports anti-parallel 
edges between nodes; in addition, a cyclic loop of edges is also supported. 
Thus, the graphical model offers the flexibility to incorporate all possible 
types of relationships, unlike in the traditional systems where only directed 
edges are supported. 

In addition, for every edge added in the graph, the probability of an event 
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Algorithm 1 Event Precedence Model 


Require: A partitioned window P for a batch of new events. 

Observation: 

1 : for each partition Wk € P where k is the CRA value do 

2 : for each pair of consecutive events (of type Ei and Ej such that i ^ j) in 

Wk do 

3 : f{Ei, Ej)++ i.e., increase the observed frequency by 1; 

4 : end for 

5 : end for 

Graph Generation: 

Construct an edgeless network G = {V, H); 
for each pair of event types, Ei and Ej such that i ^ j, do 
if f{Ei, Ej) >0 then 

H {H U {Ei Ej}} i-e., add an edge Ei —>■ Ej-, 
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fiPPi) 

E Ef,^childre7i(E^) TePET)’ 


P{E,\Ei, 

else 

P{Ej\Ei) <r- 0 ; 

end if 

if f{Ej,Ei) > 0 then 

H ^ {H U {Ej •(— Ei}} i.e., add an edge Ej 
P{Ei\Ej) - - fiEEi) -. 


Ei', 


else 

P{Ei\E,] 

end if 
end for 


^^E^^children(Ej ) 

0 ; 


type given its parent event type is calculated using Equation [T] The calcu¬ 
lated probabilities are then stored in the nodes. 

The running time complexity of the algorithm is polynomial with the total 
number of events that have arrived thus far and the number of event types. The 
observation step counts every pair of consecutive events in every partition of the 
window. Clearly, for each partition, the number of the counts is always one less 
than the number of events in it. If Ng and Np are the number of events and the 
number of partitions, respectively, then the running time complexity is given as 
0 (^^i(A^ei — 1)) ~ 0 {Ne), where Nei is the number of events in the i-th par¬ 
tition. The graph generation phase checks for the evidence of the precedence re¬ 
lationships between every pair of event types. In the worst case, the event prece- 
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dence network is completely connected (including cyclic edges and self referencing 
edges) and has edges. So, the running time complexity of this step is propor¬ 
tional to N{N — 1) or 0(A^^). Hence, the total running time is given as O(A^e) + 
0{N^) = 0{Ne + N^). 

5 Top-K Predictive Query Processing 

In this section, we first describe the predictive query processing model and its 
run-time causality test and then present the two top-k continuous predictive query 
processing algorithms - Exhaustive Search (ES) algorithm and the more efficient 
Reduced Search Early Termination (RSET) algorithm. 

5.1 Predictive Query Processing Model 

The predictive query processing problem can be formulated as a search problem 
to find the possible effects of a given set of observed events in a causal network. 
However, the traditional causal network cannot be used for query processing due to 
the causal information loss and its lack of support for cyclic causality. To address 
this issue, since we already know that every causal relationship is always a tempo¬ 
ral precedence relationship, we propose to infer causality during query processing 
from the event precedence network to determine the possible effects. 

In our work, the predictive query is a standing continuous query, so the ranked 
result list may change every time a new event is observed. The idea is to explore the 
event precedence network (EPN), which represents all the precedence relationships 
(including cyclic precedence relationships) among the event types, to answer the 
predictive queries when evidence is available. Indeed, the effect events are always 
the descendants of the cause events. Therefore, an outward breadth first search on 
the EPN is required to find the effect events. In situations where a visited node 
is encountered again, as EPN is cyclic, we ignore it. As discussed in Section HI 
the next effect events cannot occur without the existence of the most recent cause 
event. Therefore, the starting point for exploring the EPN is always the event type 
C 5 of the most recent cause event. We call this event type the ejfect observation 
point (EOP). Eor instance, in EigurelH consider the two event types and E^. E^ 
is the effect of E 4 when E 4 is the EOP whereas E 4 is the effect of E 3 when E 3 is 
the EOP, as illustrated in Eigure [5] 

However, there are two issues which make EPN not directly usable to answer 
a predictive query. Eirst, a precedence relationship is not necessarily a causal re¬ 
lationship. So, we have to remove from the precedence relationships those that 
are not causal. Second, two variables may have causal relationship in the absence 
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Figure 5: Views from the EOPs for Figure ID 


of other variables, but may not exhibit causality in the presence of certain condi¬ 
tion variables. For example, rain and wet ground are dependent variables, as rain 
causes wet ground. However, they are independent in the presence of a roof over 
the ground (which is a conditional variable), as rain does not cause wet ground, 
given the existence of a roof. Therefore, we test causal relationships to resolve 
these issues during query processing for finding possible effects. To determine 
causality, the conditional independence tests, as described in Section [331 are per¬ 
formed between two event types with an edge in the EPN. 

The ranking score of the predicted effect event type Ei is calculated as P{Ei\Cs) 
given its EOP C 5 . An EPN node stores the conditional probability of every child 
node given the current node as the parent node. Scores across a chain of event 
types. Eg ^ Ep ^ Ei (where Ep is a parent of Ei and Eg is a parent of Ep), 
in EPN is calculated using the multiplicative property of conditional probability 
P{Ei\Eg)=P{Ei\Ep)-P{Ep\Eg). 


5.2 Exhaustive Search Algorithm 

5.2.1 Approach 

The most straightforward solution to the top-k prediction problem is to search for 
all possible effects exhaustively during the run-time causal inference over EPN and 
then sort them, according to their scores, in non-increasing order to determine the 
k effects with the top scores. We call this solution the Exhaustive Search (ES) 
approach. 

The ES approach should have a robust strategy for exploring the EPN to infer 
effects. As discussed earlier, an outward breadth first search may be run over the 
EPN for run-time causal inference. However, the score calculation of the effects 
is not straight-forward. To apply multiplicative property of conditional probabil¬ 
ity described in Section 15.11 the scores of the parents of an event type should be 
known before its score is calculated, which is not always possible as demonstrated 
in Pigure[ 6 ] Therefore, we define a search order, called causal search order, before 
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start breadth first search. 



(Score of Ej is P{Ej\Ei) + P{Ej\Ei) ■ P{Ei\Ei). Note that Ei is 
explored only after Ej. Therefore, P{Ei \Ei) is unknown during the 
calculation of Ej’s score.) 

Figure 6: Illustration of the need for a causal search order. 


exploring the EPN for run-time causal inference. 

Definition 2 (Causal Search Order). The causal search order O is an ordered set 
of event types {Si, S 2 , ■■■■, Sjy} observed during the outward breadth first search 
of the EPN such that is never an ancestor of Si, where j > 0 and i + j < N. 

□ 

In addition to guiding the search for possible effects, the causal search order 
provides us with an effective strategy to calculate the ranking score. It gives an 
order of the event types such that the probabilities of parents are always known 
before calculating the probabilities of their children. 

Example 4 Eet us illustrate the causal search order considering the EPN shown 
in EigurelH As described earlier, we run outward breadth first search in EPN from 
the EOP. Suppose is the EOP Initially, is added to O and is explored. Then, 
the children of E^ are added, so O becomes {E^, Ei,E 4 , E^}. Then, since E^ has 
already been explored, the next unexplored node in O, Ei, is explored. However, 
no new nodes are added to O as Ei has no child. Then, we consider the next unex¬ 
plored node {E 4 ) in O. Similar to Ei, no new nodes are added to O as E 4 has no 
child. Now, the only remaining unexplored node in O is E^. So, the children of E^ 
are added to O, which then becomes {E^, Ei, E^i, E^, E^, Ej}. The recently added 
unexplored nodes Eq and Ej do not have any children and, therefore, no new nodes 
are added to O. So, the final causal search order O is {E^, Ei, E^^, Erj, Eq, E-j}. 
These steps are shown in Eigure Ha)- Similarly, when EOP is Eq, the causal 
search order O may be determined to be {EQ,Ej,EQ,E‘i,Ei,E 4 }, as shown in 
Eigure Hb). □ 
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i) 0 = {E3} 



.1 0 = {Es,E„Es,E3,E„E,) 


(b) Es as EOP 


Figure 7: Illustration of the steps to determine causal search order for the event 
precedence network of Figure H] 


5.2.2 Algorithm 

Algorithm outlines the ES algorithm. It has a two-pass strategy for exploring 
EPN to infer effects. In the first pass, breadth-first search is run over the EPN 
to determine causal search order. In the second pass, EPN is explored with this 
search order for run-time causal inference. The input to the algorithm is the size of 
the result k, the event precedence network G, and the set of the recently observed 
6 cause event types arranged in temporal order {Ci, C 2 , The four main 

steps of the algorithm are given as follows. 

1. Eirst, an outward breadth first search over G from EOP (the most recent 
cause event type Cs) is run to determine the causal search order O. Eine 1 
shows this step. 

2. Second, the marginal independence tests are performed between the event 
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types during the search to remove any weak relationships (lines 2-7). (A 
marginal independence test disregards the effect of other event types; in other 
words, it is equivalent to a conditional independence test with an empty con¬ 
dition set.) 

3. Third, G is searched to find the effects of every unexplored node Ej based 
on the ordering of the event types in O. Lines 8-19 shows this step. The Cl 
tests between Ej and each of its parents are performed as only the parents 
of Ej can have effect on it. These tests are required to make sure that the 
event types being considered are not independent in the presence of other 
event types. In case of independence between Ej and its parent, the edge 
representing their precedence relationship in G is removed. Lines 9-16 
shows this step. 

Then, the score of the node Ej is calculated and stored, as shown in lines 17 
and 18. 

4. Finally, the ranking scores of all event types explored in non-increasing order 
are sorted and then the k event types with the top scores are selected (line 20). 


Example 5 Let us illustrate the ES algorithm considering the EPN shown in Eig- 
urelH Suppose C is {E 2 , E^} and k is 2. 

1. Eirst, the causal search order O, from the EOF (i.e., E^), is determined as 
{-^ 3 ) El, 7 ^ 4 , E^, Eq, E-r}. 

2. Then, the marginal independence tests are performed on each edge in EPN. 
For simplicity in illustration, we assume that these tests fail to remove any 
edges. 

3. Now, the score of each event type in O is calculated and stored into the buffer 
Bt based on their ordering in O. 

(a) The score of the first unexplored event type Ei is calculated and up¬ 
dated in Bt as follows. 

• Determine the parents of Ei, Sparents = {-S's} 

• Perform Cl tests between every parent of Ei (i.e., Sparents) and 
El. Suppose the Cl test succeeds, and thus no edge is removed. 

• Calculate the score of Ei: PiEi\C) = PiEilEs) ■ PiE^lC) = 
0.333. 

• Update as {(77i, 0.333)}. 
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Algorithm 2 ES Algorithm 

Require: A temporally ordered set of reeently observed 5 eause event types S = 
{Cl, C 2 ,C^}, the size of the result k, an empty buffer Bt to store the effeet 
event types and their seores, and the event preeedenee network G = {V, S). 

1 : Determine causal search order, O, with the outward breadth first seareh in G 


2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 
19 


from the EOF (Gs)- 

for every edge Ei Ej gE do 

isindependent ^ IsIndependent{Ei , Ej^cf)))-, 
if isindependent is true then 

H ^ S — {Ei —)■ Ej{ (//remove the weak relationship.) 
end if 
end for 

for every node Ej € (O — C^) do 
^parents parents of Ej , 

for every parent node Ei G Sparents do 

isindependent ^ IcMi{Ei, Ej, Sparents - {Ei}) 
if isindependent is true then 
r. •<— r. — {Ei —)■ Ej}', 

Sparents Sparents - {E^}■, 

end if 
end for 

P(Ej|C) ^ P{E, |£p)F(£p|C); 

Insert the pair {Ej, P{Ej\C)) into Bt', 

end for 
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Sort the nodes in Bt in the non-inereasing order of the seore and return the 


top-k results from Bt- 


(b) The seore of the next unexplored event type E 4 is ealeulated similarly 
as above, and Bt is updated to 0.333), (£' 4 ,0.50)}. 

(e) Following the same step as above, Bt is updated to {(£ 1 ,0.333), 

(£ 4 ,0.50), (£ 5 ,0.167)} for the next event type £ 5 . 

(d) Next, Bt is updated to {(£ 1 ,0.333), (£ 4 ,0.50), (£ 5 ,0.167), (£ 0 ,0.0835)} 
for the event type Eq. 

(e) Bt is updated to {(£ 1 ,0.333), (£ 4 ,0.50), (£ 5 ,0.167), (£ 0 ,0.0835), 

(£ 7 ,0)} for the event type £ 7 . Note that the seore £(£ 7 |C) of £7 is 
assigned as zero. For the parent event type £ 5 , the probability of its 
ehild £7 is mueh lower (half) than the probability of its other ehild Eq. 
Thus, for this illustration, we assume the Cl test between £7 and £5 


19 




fails. Consequently, there is no edge between them. 

4. Finally, Bt is sorted in non-increasing order of the score as {(£^ 4 ,0.50), {Ei , 
0.333), (£'5,0.167), (£6,0.0835), (£ 7 , 0 )}. Then, the top-2 predicted next 
event types are selected from Bt as {(£ 4 ,0.50), (£ 4 ,0.333)}. 

□ 

5.3 Reduced Search Early Termination Algorithm 

5.3.1 Approach 

The ES algorithm searches exhaustively in the EPN during query processing to 
determine the top-k results and, therefore, may well be slow. More importantly, it 
scales very poorly and, therefore, this approach is likely to be intractable for a large 
network. Naturally there is a need for an alternative method that is faster. There 
are two issues to deal with for that purpose. 

1. Running time : With the exhaustive search of ES on EPN to find all possible 
results, the search space increases with the number of variables and it per¬ 
forms unnecessary computations. A good query processing method should 
avoid redundant and unnecessary computations to reduce the running time. 

2. Accuracy : Usually, the tradeoff for reducing running time is the loss in 
the accuracy of the results. The running time is decreased by reducing the 
search space, which may skip the correct effect events. The query processing 
method should avoid such an incident as much as possible. 

To achieve faster running time while predicting accurate and consistent results, 
we propose a new algorithm called the Reduced Search Early Termination (RSET) 
algorithm. It is built upon the ES algorithm. The following strategies reduces the 
runtime with only marginal reduction in prediction accuracy. 

1. To reduce the running time. There are two ideas to reduce the running time. 
Eirst, we reduce the search space in EPN by exploring only the descendants 
of the nodes in the current top-k during query execution. The nodes not in the 
top-k or their descendants have lower scores, due to multiplicative property 
of conditional probability described in Section ISUl thereby disqualified from 
being top-k candidates. Second, we use a priority-based breadth-first search 
with an early termination criterion such that the query execution is stopped 
as soon as it is certain that the top-k results have been found. The priority- 
based breadth-first search always chooses the unexplored descendant node 
with the highest score to explore its children. The early termination criterion 
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is met if there is no change in the list of event types in the top-k; that means, 
there is no more descendant node whose score can be greater than those in 
the current top-k. Consequently, the search space is only partially explored. 
For this reason, even though the causal inference is done at runtime, it incurs 
only a small overhead. 

2. To predict accurate results. In an effort to achieve the same level of accuracy 
as the exhaustive search, we employ two ideas. First, we calculate the rank¬ 
ing score with the evidence from all explored nodes. Second, we perform 
the breadth-first search of the EPN in such a way that the events with greater 
score are processed earlier. It is worth noting that the ancestors always have 
higher or the same (in the worst condition) score as their descendants. 

5.3.2 Algorithm 

Algorithm [3] outlines the RSET algorithm and can be described as follows. 

1. Eirst, two empty buffers. Be and are created to store the event types 
explored during query processing and the top-k effect event types computed, 
respectively. B^ can hold maximum k event types. Eine 1 states this step. 

2. Second, we add the EOF, Cs, with 1 as its score to both these buffers. Cs 
has the probability of 1 since the event type has already been observed. Eine 
2 states this step. 

3. Then, the algorithm employs the following strategies to (a) reduce the search 
space, (b) predict accurate results, and (c) terminate as early as possible. 

• To reduce the search space. Two ideas are employed to reduce the 
search space. Eirst, it explores only the children of EOF or the event 
types in the buffer Bk for further computation. Eine 3 reflects this strat¬ 
egy. Eor any event type Ec not in B^ (i.e., top-k ), Ec and its descen¬ 
dants are ignored, thus reducing the search space. The probabilities of 
the children of E^ are much lower than that of Ec due to multiplica¬ 
tive property of conditional probability, or equal even in the (rare) best 
case. Second, the buffer B^ is always sorted in non-increasing order of 
the score after a new event type is added to it (lines 18-19 and 23-24). 
So, the priority for network exploration is always given to the unvisited 
node with the highest score. This means that we consider the fact that 
its children might have higher ranking scores than the unvisited nodes 
already in B^. Consequently, the nodes with the lowest probability 
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(Eiowest) in Bk can be removed if Bk is full, which further reduces the 
search space. Lines 3 and 23 reflect this mechanism. 

• To predict accurate results. It keeps in Be the record of all event 
types visited. Basically, the event types in EPN are explored in the 
breadth-first search. Therefore, the parents of a node have already been 
explored by the time the algorithm considers the child node. While 
doing so, the causal relationships are tested in lines 7 - 14. Although 
the first strategy above reduces the search space, the algorithm still 
keeps all ancestors in Be by recording all visited nodes so far. Due to 
this search strategy, only the nodes with lower scores are not visited. 
So, the ranking score calculation is not affected by the reduced search 
space. Consequently, the accuracy of prediction result is not degraded. 
Lines 7, 15, and 16 make use of this strategy. 

• To terminate as early as possible. An early termination is possible if 
there is no change in the list of event types in Bk after exploring their 
children. It means that there are no more event types further down the 
current level of exploration that can have higher probability than those 
already in Bk- Checking only unvisited nodes in line 3 reflects this 
strategy. 

The computational complexities of the RSET and ES algorithms are dominated 
by the number of conditional independence tests, which is exponential in the worst 
case as shown in our prior work on the ECNI algorithm |Tj, and thus both the 
RSET and ES algorithms have exponential computational complexity in the worst 
case. In practice, however, the RSET algorithm achieves a significant reduction in 
the computational complexity due to pruning by reduced search space and early 
termination. 

Example 6 Eet us illustrate the RSET algorithm considering the EPN shown in 
EigurelH Suppose C is {E 2 ,E^} and k is 2. 

1. Eirst, two empty buffers Be and Bk are created to store all the event types 
explored so far and to store the current top-k predicted event types, respec¬ 
tively. 

2. Then, the search starts with the EOP (E^) and updates Be to {{E^, 1)}. 

3. Now, every unvisited event type is explored as follows. Eor simplicity in 
illustration, we assume that the Cl tests return false and hence the edges are 
not removed. 
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(a) The first unvisited event type, E^, is marked as visited and its ehildren, 
s children (= {El, E 4 , E 5 }), are explored. 

i. The seore of the first unexplored ehild, Ei, is ealeulated and added 
to the two buffers as follows. 
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• Determine the parents of Ey, Sparents i set to {-B 3 }. 

• Perform Cl test of the edge between Ei and E 3 given Sparents- 

• Calculate the score of Ei as P{Ei\C) = P{Ei\E^)P{E^\C) 

= 0.333. 

• Update Sc" to {(£^ 3 ,1), (Si,0.333)}. 

• Update and sort Bk to {(£ 1 ,0.333)}. 

ii. The same steps as above are followed for the next unexplored child 
£4. The two buffers Be and B^ are updated to {(£3, 1), (£1 ,0.333), 
(£ 4 , 0 . 50 )} and {(£ 4 , 0 . 50 ), (£ 1 , 0 . 333 )}, respectively, 
hi. Be and Bj. are updated to {(£3, 1), (£ 1 ,0.333), (£ 4 , 0 . 50 ), (£5, 
0.167)} and {(£ 4 , 0 . 50 ), 

(£ 1 ,0.333)}, respectively, for the next unexplored child £5. 

(b) Now, the next unvisited event type, £4 in Bk, is marked as visited 
and its children are explored. However, there are no child of £4, i.e., 
Schiidren is empty. Therefore, there is no computation to be performed. 

(c) The same result is seen for the next event type, £ 1 , as well. As it has no 
child (i.e., SchUdren is empty), there is no computation to be performed. 

4. The top-k result in Bk is obtained as {(£ 4 , 0.50), (£ 1 ,0.333)}. 

For the same C, the RSET algorithm considered only four event types - £ 3 , £ 1 , £ 4 , 
£5 - and the ES algorithm considered all event types - £3, £ 1 , £4, £5, £ 5 , £7. 
Note that in this example, RSET produced the same result (i.e., Bk = {(£4, 0.50), 
(£ 1 ,0.333)}) as ES (which is typical unless the value of k is significantly large). 

It shows the merit of the early terminating reduced search approach of the RSET 
algorithm against the exhaustive search approach of the ES algorithm. q 

6 Performance Evaluation 

We conduct experiments to evaluate the run-time causal inference model and the 
top-k query processing mechanism in the proposed RSET algorithm and the ES 
algorithm. One evaluation is with respect to the accuracy of the top-k results, 
and the other evaluation is with respect to the runtime. Section IQ] describes the 
experiment setup, including the evaluation metrics, datasets and the platform used, 
and Section 16 ^ 2 ] presents the experiment results. 
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6.1 Experiment Setup 

6.1.1 Evaluation Metrics 

Intuitively, the performanees of the top-k query processing algorithms are best eval¬ 
uated by examining two important evaluation eriteria - aeeuraey and runtime. 


1. Accuracy. Suppose Rk is the ranked list of the top-k effeets predieted 
from the set of eause event types, C, observed so far in the test sequenee. 
To measure aeeuraey, the next event type observed in the test sequenee, Eq 
(o G [1, A^]), is eheeked against the predieted ranked list R^. If Eg exists in 
Rk, we say the predietion is eorreet (hit); otherwise we say the predietion is 
ineorreet (miss). 

There are two methods for deeiding the eorreetness of a top-k predieted re¬ 
sult. First, in some seenarios, a hit may be a suffieient eondition for eorreet¬ 
ness. In sueh a ease, we say the aeeuraey is 100% for a hit and 0% for a 
miss. We eall this perspeetive hit-or-miss (non-weighted). Seeond, in other 
seenarios, the rank of the predieted result may also play an important role in 
determining the aeeuraey. Clearly, if the algorithm prediets Eg as the most 
likely effeet (i.e., at the top of Rk with the highest probability), then the 
aeeuraey is 100%. As we go down the list in Rk, the aeeuraey deereases. 
Therefore, to refleet this point, we take the probability of eaeh event in Rk 
into eonsideration when ealeulating the predietion aeeuraey. We eall this 
perspeetive weighted. 

Hit-or-miss accuracy. Let Uhits and rimisses be the number of hits and 
the number of misses, respeetively out of ntests tests. Then, the hit-or-miss 
aeeuraey of the result, ahm, is ealeulated as follows. 


f^hits ^hits 

C^hm — — ~ 

Tl'tests 'IT'hits T flrnisses 


( 2 ) 


Weighted accuracy. Suppose P{Eo) is the seore of Eq in Rk- As diseussed 
earlier, the rank of Eg in Rk eontributes towards the ealeulation of the pre¬ 
dietion aeeuraey. The rank is based on the seore; therefore, we normalize 
the seore sueh that the predietion aeeuraey deereases gradually with the de- 
erease in the rank of Eg in Rk- The aeeuraey of Eg, in the ease of a miss, 

P(Eo) 


is 0% whereas the aeeuraey of Eg, in the ease of a hit, is :p iax{P{E )\E <=Rk} ^ 
where the denominator is the highest probability among all event types in 
Rk- (Note that this measure gives the top event type the aeeuraey of 100%.) 
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Let Uhits and Umisses be the same as above. Then, the weighted accuracy of 
the result, aweighted^ is computed as follows. 


(^weighted 


E f^tests 
2=1 


PjEo,) 

max{P{Ej.)\Ej.&Rj,J 

'^tests 


where Eq- is the i-th observed event type in the test data. 

As mentioned earlier, the accuracy of a miss is 0%. Therefore, we can con¬ 
sider only the hits in the numerator. 


P(^oh) 

h=l max{P(E^^)\Ej^GR,J 
'^hits “b '^misses 

where is the h-th observed event type which has the result hit in the test 
data. 


Oiweighted — 


2. Runtime. The runtime is the CPU time taken during query processing. 
Note that the event precedence network construction is not part of the query 
processing mechanism. Therefore, we do not include it in the measurement 
of the runtime. In the query processing with the RSET algorithm and the 
ES algorithm, there is an overhead of run-time causal inference and it is 
included in the runtime. In contrast, the query processing with the traditional 
causal inference (i.c., ECNI algorithm) does not include the causal network 
construction time in the runtime as the causal inference is performed only 
once (during the causal network construction) prior to query processing. In 
our work, latency is the interval between the arrival of a new event and the 
identification of its top-k effect events. However, as the time for EPN update 
is insignificant (see the polynomial runtime in Section |4^ compared to the 
time for query processing (which is exponential), latency is essentially the 
query processing time. 


6.1.2 Datasets 

Experiments are conducted using two real-world datasets (summary in Table |2]) to 
evaluate the proposed algorithms. 


Electric power grid dataset This dataset contains simulated temporal sequences 
of cascading electric power grid component outages, such as those that can lead to 
very large blackouts (e.g., 1321). The sequences were generated using a model of 
the Polish power network, which is described in llT2l . Each sequence represents the 
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Dataset 

N 

^instances 

CRA 

Ncra 

MSNBC 

17 

4698795 

session id 

989818 

Power grid 

565 

94339 

blackout id 

4492 


(Ncra is the number of different CRA values. Ni„stances is the number of event instances 

in the dataset.) 

Table 2: Profiles of the datasets. 


order in which the components failed, as well as the time of the failure. Each grid 
component is considered an event type, whereas a component failure is an event 
instance. This dataset includes 4492 cascade sequences and 565 distinct event 
types. 

In the original dataset, each file, representing one blackout, has a list of the 
components that failed in that blackout. The original schema of the power grid 
data is as follows: {event indicator, timestamp, component id). The event indicator 
can be one of 0, 1, and -1, which refer to an initiating event, a dependent event, 
and a stop event, respectively. There is always at least one initiating event at the 
beginning of each component failure sequence (with 0 as its starting time). Since 
these events are always at the beginning of the sequence, there is no inward edge 
towards them in the event precedence network. A dependent event is the result of 
the initiating event or another dependent event. A blackout sequence always has at 
least one dependent event. We treat both an initiating event and a dependent event 
in the same way. On the other hand, a stop event denotes the end of the blackout 
and is not a real event. Therefore, we ignore stop events. The timestamp and the 
component id are respectively the starting time of an event and the attribute that 
uniquely identifies a grid component. 

To create an event stream, we modify the schema and mix the data from the files 
in random order while preserving the temporal order of the component failures in 
each blackout. The modified schema, {timestamp, component id, blackout id, event 
indicator), has an additional tag blackout id to identify the blackout to which the 
component failure belongs to. So, the blackout id is the CRA in the power grid 
dataset. 

MSNBC.com web dataset This dataset consists of click-stream data of 989818 
sequences obtained from the University of California, Irvine’s machine learning 
repository |[T8]l . Each sequence reflects the browsing activities, arranged in tempo¬ 
ral order, in one user session. The dataset gives a random sample of the length of 
visits of users browsing the msnbc.com web site on the whole day of September 
28, 1999. The length of the visit is an estimate of the total number of clicks or 
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pages seen by each user and is based on the “Internet Information Server (IIS) logs 
for msnbc.com and news-related portions of msn.com”. A webpage category is an 
event type, and a webpage visit is an event instance. The session id of the visit is 
the CRA for its event instance. 

The number of distinct event types is 17. That is, a sequence can have web ac¬ 
tivities related to 17 different webpage categories. These event types (i.e., webpage 
categories) are frontpage, news, technology, local, opinion, on-air, miscellaneous, 
weather, msn-news, health, living, business, msn-sports, sports, summary, bbs, and 
travel. The total number of event instances (i.e., page visits) is 4698795. 

To create an event stream, we randomly mix the events while preserving the 
temporal order of the events for each session. The schema of the events is {time- 
stamp, webpage category, session id, <h), where denotes an empty attribute set. 

6.1.3 Experiment Platform 

The experiments are conducted on a 2.3 GHz Intel Core i5 machine with 4GB of 
memory, running Windows 7. The algorithms are implemented in Java 1.7.0. 

6.2 Experiment Results 

We perform two sets of experiments to evaluate the RSET and ES algorithms. 
One set of experiments is to evaluate the prediction accuracy, and the other set 
of experiments is to evaluate the runtime. There are two objectives in each set 
of experiments. The first objective is to compare the query processing with the 
run-time causal inference mechanism (of RSET and ES) against the query pro¬ 
cessing with the traditional causal inference mechanism (of the ECNI algorithm). 
Eor a fair comparison, as the goal is only to compare the causal inference mecha¬ 
nisms, the query processing mechanism of either RSET algorithm or ES algorithm 
can be used in the ECNI algorithm, and, in this experiment, we choose the RSET 
algorithm. The second objective is to compare the query processing mechanism 
between the RSET algorithm and the ES algorithm. In addition, the effects of k on 
the proposed algorithms are studied in each set of experiments. 

We divide each real dataset into 70% for training and 30% for testing the pro¬ 
posed algorithms. The division of MSNBC dataset is done by session id, and the 
division of electric power grid dataset is by blackout id. Erom the event stream of 
training dataset, EPN is constructed as an input to the RSET and ES algorithms, 
whereas a causal network is constructed as an input to the ECNI algorithm. The 
window observation period T is set to 10 msec for the MSNBC dataset and 10 sec 
for the electric power grid dataset. The testing data simulates event stream. As 
soon as a new event arrives, it is added to the partitioned window. Then, the top-k 
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prediction query execution is triggered in response to the most recent event at posi¬ 
tion 6 (called the EOP index) in the sequence of cause events in the same partition. 
Note that the RSET and ES algorithms perform query processing over the EPN 
whereas the ECNI algorithm does so over the causal network. Upon the arrival of a 
new event, the measurements of prediction accuracy and runtime are repeated and 
the calculated average accuracy and average runtime are reported. 

6.2.1 Accuracy 

Eigures[8]and|9l respectively, show the hit-or-miss and the weighted accuracies for 
the MSNBC dataset and Eigures [TOl and fTT] show them for the power grid dataset. 
In these figures, the accuracies of the RSET algorithm, the ES algorithm, and the 
ECNI algorithm are compared for different values of the EOP index (J) over the 
sequence of events in the condition set. In addition, Pigures[T2]and[T3]show the av¬ 
erage hit-or-miss and weighted accuracies for different values of k in the MSNBC 
dataset and the power grid dataset, respectively. 

As expected, for all cases, the hit-or-miss accuracy is never lower than the 
weighted accuracy. Clearly, a hit in the hit-or-miss accuracy always receives the 
score of 1 while a hit in the weighted accuracy receives a score lower than 1 unless 
the observed event type in the test data has the highest score in the ranked list. 
When k is one, the size of the result ranked list is one, hence Equation |3]reduces to 
Equation m and therefore the two accuracy measures give the same value. 

Comparison of the causal inference mechanisms 

All results show that the prediction accuracies of both ES and RSET algorithms are 
significantly higher than that of the ECNI algorithm at every EOP index (5). This 
difference in accuracy comes from the difference in their causal models. (Recall 
that for fairness the ECNI algorithm uses the same query processing mechanism 
used in the RSET algorithm.) Thus, it confirms the expectation that the traditional 
causal model (of the ECNI algorithm) is so limited due to its lack of support for 
cyclic causality and the loss of causal information that the prediction accuracy 
is compromised significantly. On the other hand, the RSET and ES algorithms 
both use run-time causal inference which can handle cyclic causality and causal 
information loss, thereby achieving higher prediction accuracies. 

The results also show that the accuracy of all three algorithms increases with 
the increase of k. The reason for this is that more event types are considered as the 
probable next effects. Moreover, the accuracy in the ES and the RSET algorithms 
is always higher than in the ECNI algorithm. This indicates that the deficiencies 
of the ECNI algorithm (i.e., acyclic causality and causal information loss) leads 
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(c) k = 5 


(d) k = 7 



(e) k = 9 

Figure 8: Hit-or-miss accuracies of the RSET, ES, and ECNI algorithms w.r.t EOF 
index (6) for MSNBC dataset. (Note that the EOP index 5 is the position of the most 
recent event in the sequence of cause events in the same partition.) 


to excluding many important causal relationships and, as a result, the accuracy is 
always much lower than that of the ES and the RSET algorithms regardless of the 
value of k. 
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(c) k = 5 


(d) k = 7 



(e) k = 9 

Figure 9: Weighted aeeuraeies of the RSET, ES, and ECNI algorithms w.r.t EOF 
index (6) for MSNBC dataset. 


Comparison of the query processing mechanisms 

Three observations are made from the results. Eirst, all results show that the predie- 
tion aeeuraey of the ES algorithm is always higher than that of the RSET algorithm. 
This is evident from the faet that the ES algorithm performs an exhaustive seareh 
whereas the RSET algorithm performs only a partial seareh. 

Seeond, the aeeuraey of the RSET algorithm is more eomparable to that of the 
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(e) k = 20 

Figure 10: Hit-or-miss accuracies of the RSET, ES, and ECNI algorithms w.r.t EOF 
index (<5) for the power grid dataset. 


ES algorithm in the power grid dataset than in the MSNBC dataset. This can be ex¬ 
plained as follows. When the ratio of kto N is larger, both algorithms have higher 
probabilities of making correct predictions but the gap between their accuracies is 
larger because ES performs exhaustive search while RSET performs partial. Note 
that k/N is smaller in the power grid dataset (the largest k/N considered is 0.035) 
than in the MSNBC dataset (the largest k/N considered is 0.53), and therefore the 
gap is smaller for the power grid dataset. 
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(e) k = 20 

Figure 11: Weighted aeeuraeies of the RSET, ES, and ECNl algorithms w.r.t EOF 
index (<5) for the power grid dataset. 


Third, as diseussed earlier, as the value of k inereases, the aeeuraeies of all 
three algorithms inerease. Here we add further evaluation on the ES algorithm and 
the RSET algorithm with a foeus on their seareh meehanisms. As k inereases, 
the seareh spaee of the RSET algorithm inereases, leading to a higher gain in the 
aeeuraey. In the ease of the ES algorithm, the seareh spaee remains eonstant re¬ 
gardless of k, but the number of eandidate effeets from whieh the highest k is 
seleeted inereases and, eonsequently, the aeeuraey still inereases. Intuitively, the 
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Figure 12: Accuracies of the RSET, ES, and ECNI algorithms w.r.t k for MSNBC 
dataset. 
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Eigure 13: Accuracies of the RSET, ES, and ECNI algorithms w.r.t k for the power 
grid dataset. 


rate of increase in the accuracy is higher for the hit-or-miss accuracy metric than 
the weighted accuracy metric in both algorithms. 

6.2.2 Runtime 

In this experiment, we compare the runtime among the three algorithms (RSET, 
ES, ECNI). In addition, we analyze the effects of k on the runtime. 

Eigures [14] and [15] show the runtime for the MSNBC dataset and the power 
grid dataset, respectively. In these figures, the runtime of the three algorithms are 
compared for different values of the EOF index (5) over the sequence of events in 
the condition set. In addition, Eigure [l6] shows the runtime for different values of 
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(e) k = 9 

Figure 14: Runtime of the RSET, ES, and ECNI algorithms w.r.t EOF index (5) for 
MSNBC dataset. 


k in the MSNBC and the power grid datasets. 

Comparison of the causal inference mechanisms. 

The results show that the runtimes of the RSET and ES algorithms are longer than 
that of the ECNI algorithm at every EOF index for every value of k. As discussed 
in Section 15.31 the RSET and ES algorithms have an overhead of the run-time 
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(a) k=l 


(b) k = 5 




(c) k = 10 


(d) k = 15 



(e) k = 20 


Figure 15: Runtime of the RSET, ES, and ECNI algorithms w.r.t EOF index (S) for 
the power grid dataset. 


eausal inferenee during query proeessing while ECNI algorithm does not as it uses 
a pre-built eausal network for predietion. Therefore, the runtimes for the ES and 
RSET algorithms are always longer than that of the ECNI algorithm. Interestingly, 
the runtimes of the three algorithms are longer in the power grid dataset than the 
MSNBC dataset. This is due to a larger number of event types (i.e., N) in the power 
grid dataset. 
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(a) MSNBC 


(b) Powergrid 


Figure 16: Runtime of the RSET, ES, and ECNI algorithms w.r.t k. 


Comparison of the query processing mechanisms. 

The results suggest three observations. Eirst, as expected, the runtime of the RSET 
algorithm is always shorter than the ES algorithm. The main reason is in the dif- 
feren search scope (i.e., exhaustive in ES and partial in RSET) as discussed in 
Section |5] In addition, there is an overhead in the ES algorithm, unlike the RSET 
algorithm, to calculate the causal search order. 

Second, the runtime difference between the RSET algorithm and the ES al¬ 
gorithm is smaller for the MSNBC dataset (Eigure [T4l) than for the power grid 
dataset (Eigure [T5]) . This is due to the larger network size (i.e., N) in the power 
grid dataset (than in the MSNBC dataset). That is, a larger network size results in 
a larger search space (which is typical over event streams) and thus requires longer 
runtime for query processing. 

Third, the runtime of the ES algorithm does not change with an increase in k. 
The ES algorithm always runs an exhaustive search, irrespective of the value of 
k, and uses k only to filter out the top-k event types out of N event types at the 
end, which has insignificant effect on the overall runtime. On the other hand, the 
runtime of the RSET algorithm does increase with an increase in k. The search 
space covered by the RSET algorithm increases with k, leading to the increased 
runtime. 

6.2.3 Discussion of experiment results 

The proposed run-time causal inference mechanism, in the RSET and ES algo¬ 
rithms, handles cyclic causality and avoids the causal information loss, and thus 
improves prediction accuracy significantly. The ECNI algorithm, on the other hand, 
performs worse than both the RSET and ES algorithms as it cannot handle cyclic 
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causality. The ratios of the numher of cycles over the numher of edges in the EPN 
are 0.69 and 0.85 for the power grid dataset and the MSNBC dataset, respectively. 
Intuitively, the accuracy of the FCNI algorithm would suffer increasingly more 
as the number of cycles increases. Thus, despite its reduced runtime, the FCNI 
algorithm is not suitable for top-k predictive query processing over event streams. 

The accuracy of the RSET algorithm is comparable to the accuracy of the ES 
algorithm. Therefore, the RSET algorithm, as it is much faster than the ES algo¬ 
rithm, is more suitable for real-time continuous top-k query processing over event 
streams whereas the ES algorithm is more suitable when the time is of lesser im¬ 
portance. This will become increasingly evident for real-time applications with 
hundreds (or possibly thousands) of event types because then the pruning effect of 
reduced search and early termination becomes increasingly more significant. 

Between the two datasets, the runtime for the power grid dataset is much longer 
than the runtime for the MSNBC dataset. This is due to the difference in the num¬ 
bers of event types in these two datasets. That is, the much larger number of event 
types in the power grid dataset leads to much more conditional independence tests 
during the run-time causal inference, thus resulting in slower query execution. In 
our work, the runtime measurements were made on a low-end laptop. The use of a 
more powerful computational setup (e.g., parallel processing) would further reduce 
the runtime. 

7 Conclusion and Future Work 

This paper focused on the problem of continuous top-k prediction over event streams 
We proposed a novel causal inference mechanism, called run-time causal infer¬ 
ence, to support the cyclic causal relationships and overcome the causal informa¬ 
tion loss. Then, we proposed two query processing algorithms, called the Reduced 
Search with Early Termination (RSET) algorithm and the Exhaustive Search (ES) 
algorithm, which use run-time causal inference to predict top-k effects continu¬ 
ously. Finally, through experiments, we demonstrated that the proposed approach 
overcomes the two main limitations of the traditional causal inference approach - 
acyclic causality and causal information loss. We showed that the proposed RSET 
and ES algorithms greatly improved the causal inference power for real data seen 
these days in various applications. 

There are a number of issues we plan to address in the future work. First, in this 
paper we assume events in a stream are in the correct temporal order. If, however, 
events arrive out of order, erroneous relationships are introduced in the event prece¬ 
dence network, thereby degrading the accuracy of prediction. One idea to deal with 
such an out-of-order stream is to allow for ambiguity in the edge direction by in- 
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traducing, for example, undirected edges and then allow the algorithm to resolve 
edge directions at query processing time. Second, in this paper we only support 
direct causality and, therefore, one level of prediction, under the assumption that 
an event is the most likely effect of only its immediately preceding event. Ex¬ 
tended from this, supporting indirect causality between events, thus multiple levels 
of causal prediction through a chain of intermediate events, would be interesting. 
Since the mechanism to compute the propagation of probabilities through the event 
precedence network is already in place (see Examples |5] and this extension is not 
conceptually difficult. However, the computational cost, which increases exponen¬ 
tially with the number of prediction levels, may be a challenge. Third, in this paper 
event types are assumed to be provided by domain experts. Some applications may 
require that the EPN constructor automatically identify and define evenf fypes from 
evenf streams. 
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